Deep convolutional neural networks (CNNs) have proven highly effective forvisual recognition, where learning a universal representation from activationsof convolutional layer plays a fundamental problem. In this paper, we presentFisher Vector encoding with Variational Auto-Encoder (FV-VAE), a novel deeparchitecture that quantizes the local activations of convolutional layer in adeep generative model, by training them in an end-to-end manner. To incorporateFV encoding strategy into deep generative models, we introduce VariationalAuto-Encoder model, which steers a variational inference and learning in aneural network which can be straightforwardly optimized using standardstochastic gradient method. Different from the FV characterized by conventionalgenerative models (e.g., Gaussian Mixture Model) which parsimoniously fit adiscrete mixture model to data distribution, the proposed FV-VAE is moreflexible to represent the natural property of data for better generalization.Extensive experiments are conducted on three public datasets, i.e., UCF101,ActivityNet, and CUB-200-2011 in the context of video action recognition andfine-grained image classification, respectively. Superior results are reportedwhen compared to state-of-the-art representations. Most remarkably, ourproposed FV-VAE achieves to-date the best published accuracy of 94.2% onUCF101.
展开▼